Domain-Specific LLM Embedding Accuracy Evaluation¶

The two models used in experiment are BERT and BioBERT¶
  • BERT is a general-purpose language model trained on a vast corpus of general text, while BioBERT is a specialized version of BERT specifically pre-trained on large biomedical corpus.
  • BERT is trained on general text, including books, Wikipedia articles, and web pages.
  • BioBERT is trained on a large collection of biomedical articles, including PubMed abstracts and clinical notes.
In [ ]:
%pip install -q transformers torch matplotlib scipy prettytable
Python interpreter will be restarted.
Python interpreter will be restarted.
In [ ]:
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cosine
from prettytable import PrettyTable
import textwrap

# ------------------------------ Configuration ------------------------------
TABLE_WRAP_WIDTH = 50  # Set this width for better viewing on smaller screens
# ---------------------------------------------------------------------------

# Function to create embeddings using a specified model
def create_embedding(text, model_name):
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)

    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    cls_embedding = outputs.last_hidden_state[:, 0, :].numpy()
    return cls_embedding

# Function to calculate accuracy based on cosine similarity
def calculate_accuracy(embedding1, embedding2):
    similarity = 1 - cosine(embedding1, embedding2)
    return similarity

# Function to plot the accuracies of both models
def plot_accuracy(bert_accuracy, biobert_accuracy):
    plt.figure(figsize=(8, 6))
    plt.title('RAG Accuracy Comparison between BERT and BioBERT')
    plt.bar(['BERT', 'BioBERT'], [bert_accuracy, biobert_accuracy], color=['blue', 'orange'])
    plt.ylabel('Cosine Similarity (Accuracy)')
    plt.ylim(0, 1)
    plt.axhline(y=0.5, color='gray', linestyle='--', label='Random Guess')
    plt.legend()
    plt.grid()
    plt.show()

# Helper to wrap long text
def wrap_text(text, width=TABLE_WRAP_WIDTH):
    return '\n'.join(textwrap.wrap(text, width=width))

# Similar domain-specific sentences
sentence1 = "Manage and organize feedback from various languages in international operations."
sentence2 = "The association between poor glycemic control, oxidative stress, insulin resistance, and of low-grade inflammation have been suggested as putative factors linking diabetes and cardiovascular disease."

# Create embeddings using both BERT and BioBERT
bert_embedding1 = create_embedding(sentence1, 'bert-base-uncased')
bert_embedding2 = create_embedding(sentence2, 'bert-base-uncased')
biobert_embedding1 = create_embedding(sentence1, 'dmis-lab/biobert-v1.1')
biobert_embedding2 = create_embedding(sentence2, 'dmis-lab/biobert-v1.1')

# Calculate accuracies
bert_accuracy = calculate_accuracy(bert_embedding1, bert_embedding2)
biobert_accuracy = calculate_accuracy(biobert_embedding1, biobert_embedding2)

# Prepare table display
table = PrettyTable()
table.field_names = ["Model", "Sentence 1", "Sentence 2", "Cosine Similarity"]
table.align = "l"

table.add_row([
    "BERT",
    wrap_text(sentence1),
    wrap_text(sentence2),
    f"{bert_accuracy:.4f}"
])
table.add_row([
    "BioBERT",
    wrap_text(sentence1),
    wrap_text(sentence2),
    f"{biobert_accuracy:.4f}"
])

# Print table
print("\n--- Similarity Accuracy ---\n")
print(table)

# Plot results
plot_accuracy(bert_accuracy, biobert_accuracy)
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

--- Similarity Accuracy ---

+---------+-------------------------------------------+----------------------------------------------------+-------------------+
| Model   | Sentence 1                                | Sentence 2                                         | Cosine Similarity |
+---------+-------------------------------------------+----------------------------------------------------+-------------------+
| BERT    | Manage and organize feedback from various | The association between poor glycemic control,     | 0.5931            |
|         | languages in international operations.    | oxidative stress, insulin resistance, and of low-  |                   |
|         |                                           | grade inflammation have been suggested as putative |                   |
|         |                                           | factors linking diabetes and cardiovascular        |                   |
|         |                                           | disease.                                           |                   |
| BioBERT | Manage and organize feedback from various | The association between poor glycemic control,     | 0.7665            |
|         | languages in international operations.    | oxidative stress, insulin resistance, and of low-  |                   |
|         |                                           | grade inflammation have been suggested as putative |                   |
|         |                                           | factors linking diabetes and cardiovascular        |                   |
|         |                                           | disease.                                           |                   |
+---------+-------------------------------------------+----------------------------------------------------+-------------------+
No description has been provided for this image